[data] randomize_block_order() not compatible with stage fusion #26090

ericl · 2022-06-24T21:31:51Z

Why are these changes needed?

Per the discussion in #26057, fix the stage fusion issue by re-ordering the randomize stage past any 1-1 stages.

clarkzinzow · 2022-06-29T01:20:00Z

python/ray/data/_internal/plan.py

+            output.append(s)
+
+    output.extend(reorder_buf)
+    return output


Nice! This is a good one-off.

clarkzinzow · 2022-06-29T01:24:45Z

python/ray/data/dataset.py

-        )
-        return Dataset(plan, self._epoch, self._lazy)
+        plan = self._plan.with_stage(RandomizeBlocksStage(seed))
+        return Dataset(plan, self._epoch, self._lazy, defer_execution=True)


Hmm having both lazy and defer_execution is a bit odd, why can't this follow the self._lazy semantics? Is there an issue with executing this stage eagerly?

Good point. I can't remember why I did this... removed it.

Oh, it breaks lazy read->map_batches() fusion, that's why. I reverted this since it broke a unit test.

Hmm do you know why it breaks that?

clarkzinzow · 2022-06-30T01:51:39Z

python/ray/data/_internal/plan.py

@@ -329,6 +329,8 @@ def _optimize(self) -> Tuple[BlockList, DatasetStats, List[Stage]]:
        """
        context = DatasetContext.get_current()
        blocks, stats, stages = self._get_source_blocks_and_stages()
+        if context.optimize_reorder_stages:
+            stages = _reorder_stages(stages)


@ericl If you rewrite the read stage before you reorder stages, won't the read->map_batches fusion work without the lazy vs defer_execution hack?

No, because if you call .randomize_blocks() eagerly it will materialize the read blocks, so it's too late. We have to force it to be lazy as a special case.

I also need this for auto_repartition() anyway, so I think this is a sensible thing... and if we move to lazy by default it will go away.

jianoaix · 2022-07-01T16:57:23Z

python/ray/data/dataset.py

@@ -169,6 +174,8 @@ def __init__(
        plan: ExecutionPlan,
        epoch: int,
        lazy: bool,
+        *,
+        defer_execution: bool = False,


I had hard time understand why there's a defer_execution when there is already a lazy parameter? Seems to muddy the execution semantics even more.

fix it

f4ddd0d

ericl requested a review from scv119 as a code owner June 24, 2022 21:31

ericl assigned clarkzinzow Jun 24, 2022

ericl requested review from clarkzinzow, jjyao and jianoaix as code owners June 24, 2022 21:31

ericl assigned jianoaix Jun 24, 2022

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 25, 2022

clarkzinzow reviewed Jun 29, 2022

View reviewed changes

clarkzinzow approved these changes Jun 29, 2022

View reviewed changes

ericl force-pushed the fix-block-order branch from 9e31a9a to f4ddd0d Compare June 30, 2022 00:14

ericl merged commit 636a9c1 into ray-project:master Jun 30, 2022

clarkzinzow reviewed Jun 30, 2022

View reviewed changes

jianoaix reviewed Jul 1, 2022

View reviewed changes

matthewdeng mentioned this pull request Nov 22, 2022

[AIR] [Datasets] dataset returned by get_dataset_shard does not have schema #30465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] randomize_block_order() not compatible with stage fusion #26090

[data] randomize_block_order() not compatible with stage fusion #26090

ericl commented Jun 24, 2022 •

edited

Loading

clarkzinzow Jun 29, 2022

clarkzinzow Jun 29, 2022

ericl Jun 29, 2022

ericl Jun 30, 2022

clarkzinzow Jun 30, 2022 •

edited

Loading

clarkzinzow Jun 30, 2022

ericl Jun 30, 2022

jianoaix Jul 1, 2022

[data] randomize_block_order() not compatible with stage fusion #26090

[data] randomize_block_order() not compatible with stage fusion #26090

Conversation

ericl commented Jun 24, 2022 • edited Loading

Why are these changes needed?

clarkzinzow Jun 29, 2022

Choose a reason for hiding this comment

clarkzinzow Jun 29, 2022

Choose a reason for hiding this comment

ericl Jun 29, 2022

Choose a reason for hiding this comment

ericl Jun 30, 2022

Choose a reason for hiding this comment

clarkzinzow Jun 30, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Jun 30, 2022

Choose a reason for hiding this comment

ericl Jun 30, 2022

Choose a reason for hiding this comment

jianoaix Jul 1, 2022

Choose a reason for hiding this comment

ericl commented Jun 24, 2022 •

edited

Loading

clarkzinzow Jun 30, 2022 •

edited

Loading